{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 6 Solutions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.1\n",
    "### a)\n",
    "The marginal distributions are obtained by summing the probabilies over all the values of the variable being marginalized. Thus, to obtain $p(x)$ we sum over columns (i.e., over the values corresponding to different $y$):\n",
    "\n",
    "$\n",
    "\\begin{align}\n",
    "p(x_1) &= P(X = x_1) = P(X = x_1, Y = y_1) + P(X = x_1, Y = y_2) + P(X = x_1, Y = y_3) = 0.01 + 0.05 + 0.1 = 0.16 \\\\\n",
    "p(x_2) &= P(X = x_2) = P(X = x_2, Y = y_1) + P(X = x_2, Y = y_2) + P(X = x_2, Y = y_3) = 0.02 + 0.1 + 0.05 = 0.17\\\\\n",
    "p(x_3) &= P(X = x_3) = P(X = x_3, Y = y_1) + P(X = x_3, Y = y_2) + P(X = x_3, Y = y_3) = 0.03 + 0.05 + 0.03 = 0.11\\\\\n",
    "p(x_4) &= P(X = x_4) = P(X = x_4, Y = y_1) + P(X = x_4, Y = y_2) + P(X = x_4, Y = y_3) = 0.1 + 0.07 + 0.05 = 0.22\\\\\n",
    "p(x_5) &= P(X = x_5) = P(X = x_5, Y = y_1) + P(X = x_5, Y = y_2) + P(X = x_5, Y = y_3) = 0.1 + 0.2 + 0.04 = 0.34\n",
    "\\end{align}\n",
    "$\n",
    "\n",
    "As a correctness check, note that this distribution satisfies the normalization condition, i.e. that sum of the probabilities is $1$:\n",
    "\n",
    "$\n",
    "\\begin{equation}\n",
    "\\sum_{i=1}^5 p(x_i) = 1\n",
    "\\end{equation}\n",
    "$\n",
    "\n",
    "The marginal distribution $p(y)$ can be obtained in a similar way, by summing the matrix rows:\n",
    "\n",
    "$\n",
    "\\begin{align}\n",
    "p(y_1) &= P(Y = y_1) = \\sum_{i=1}^5 P(X = x_i, Y = y_1) = 0.01 + 0.02 + 0.03 + 0.1 + 0.1 = 0.26 \\\\\n",
    "p(y_2) &= P(Y = y_2) = \\sum_{i=1}^5 P(X = x_i, Y = y_2) = 0.05 + 0.1 + 0.05 + 0.07 + 0.2 = 0.47 \\\\\n",
    "p(y_3) &= P(Y = y_3) = \\sum_{i=1}^5 P(X = x_i, Y = y_3) = 0.1 + 0.05 + 0.03 + 0.05 + 0.04 = 0.27\n",
    "\\end{align}\n",
    "$\n",
    "\n",
    "We can again check that the normalization condition is satisfied:\n",
    "\n",
    "$\n",
    "\\begin{equation}\n",
    "\\sum_{i=1}^3p(y_i) = 1\n",
    "\\end{equation}\n",
    "$\n",
    "\n",
    "### b)\n",
    "To determine conditional distributions we use the definition of the conditional probability:\n",
    "\n",
    "$\n",
    "P(X = x , Y = y_1) = P(X = x | Y = y_1)P(Y = y_1) = p(x | Y = y_1) p(y_1).\n",
    "$\n",
    "\n",
    "Thus,\n",
    "\n",
    "$\n",
    "p(x_1 | Y = y_1) = \\frac{P(X = x_1, Y = y_1)}{p(y_1)} = \\frac{0.01}{0.26} \\approx 0.038\\\\\n",
    "p(x_2 | Y = y_1) = \\frac{P(X = x_2, Y = y_1)}{p(y_1)} = \\frac{0.02}{0.26} \\approx 0.077\\\\\n",
    "p(x_3 | Y = y_1) = \\frac{P(X = x_3, Y = y_1)}{p(y_1)} = \\frac{0.03}{0.26} \\approx 0.115\\\\\n",
    "p(x_4 | Y = y_1) = \\frac{P(X = x_4, Y = y_1)}{p(y_1)} = \\frac{0.1}{0.26} \\approx 0.385\\\\\n",
    "p(x_5 | Y = y_1) = \\frac{P(X = x_5, Y = y_1)}{p(y_1)} = \\frac{0.1}{0.26} \\approx 0.385\n",
    "$\n",
    "\n",
    "Likewise the conditional distribution $p(y | X = x_3)$ is given by\n",
    "\n",
    "$\n",
    "p(y_1 | X = y_3) = \\frac{P(X = x_3, Y = y_1)}{p(x_3)} = \\frac{0.03}{0.11} \\approx 0.273\\\\\n",
    "p(y_2 | X = y_3) = \\frac{P(X = x_3, Y = y_2)}{p(x_3)} = \\frac{0.05}{0.11} \\approx 0.454\\\\\n",
    "p(y_3 | X = y_3) = \\frac{P(X = x_3, Y = y_3)}{p(x_3)} = \\frac{0.03}{0.11} \\approx 0.273\n",
    "$\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.2\n",
    "\n",
    "### a)\n",
    "We can write the probability density of the two-dimensional distribution as\n",
    "\n",
    "$\n",
    "p(x,y)= \n",
    "0.4\\mathcal{N}\\left(x, y|\\begin{bmatrix} 10\\\\ 2\\end{bmatrix}, \\begin{bmatrix} 1&0\\\\0&1\\end{bmatrix}\\right)+\n",
    "0.6\\mathcal{N}\\left(x, y|\\begin{bmatrix} 0\\\\ 0\\end{bmatrix}, \\begin{bmatrix} 8.4&2.0\\\\2.0&1.7\\end{bmatrix}\\right)\n",
    "$\n",
    "\n",
    "The marginal distribution of a weighted sum of distributions is given by the weighted sum of marginals, whereas the marginals of a bivariate normal distribution $\\mathcal{N}(x,y|\\mathbf{\\mu},\\mathbf{\\Sigma})$ are obtained according to the rule\n",
    "\n",
    "$\n",
    "\\int \\mathcal{N}(x,y|\\mathbf{\\mu},\\mathbf{\\Sigma})dy= \n",
    "\\mathcal{N}(x|\\mu_x, \\Sigma_{xx}), \\\\\n",
    "\\int \\mathcal{N}(x,y|\\mathbf{\\mu},\\mathbf{\\Sigma})dx = \\mathcal{N}(y|\\mu_y, \\Sigma_{yy}) \n",
    "$\n",
    "\n",
    "Thus, the marginals of the distribution of interest are\n",
    "\n",
    "$\n",
    "p(x) = 0.4\\mathcal{N}(x| 10, 1) + 0.6\\mathcal{N}(x| 0, 8.4),\\\\\n",
    "p(y) = 0.4\\mathcal{N}(x| 2, 1) + 0.6\\mathcal{N}(x| 0, 1.7)\n",
    "$\n",
    "\n",
    "### b)\n",
    "The mean of a weighted sum of two distributions is the weighted sum of their averages\n",
    "\n",
    "$\n",
    "\\mathbb{E}_X[x] = 0.4*10 + 0.6*0 = 4,\\\\\n",
    "\\mathbb{E}_Y[y] = 0.4*2 + 0.6*0 = 0.8\n",
    "$\n",
    "\n",
    "The mode of a continuous distribution is a point where this distribution has a peak. It can be determined by solving the extremum condition for each of the marginal distributions:\n",
    "\n",
    "$\n",
    "\\frac{dp(x)}{dx} = 0,\\\\\n",
    "\\frac{dp(y)}{dy} = 0\n",
    "$\n",
    "\n",
    "In the case of a mixture of normal distributions these equations are non-linear and can be solved only numerically. After finding all the solutions of these equations one has to verify for every solution that it is a peak rather than an inflection point, i.e. that at this point\n",
    "\n",
    "$\n",
    "\\frac{d^2p(x)}{dx^2} < 0 \\text{ or } \\frac{d^2p(y)}{dy^2} < 0\n",
    "$\n",
    "\n",
    "The medians $m_x, m_y$ can be determined from the conditions\n",
    "\n",
    "$\n",
    "\\int_{-\\infty}^{m}p(x)dx = \\int^{+\\infty}_{m}p(x)dx,\\\\\n",
    "\\int_{-\\infty}^{m}p(y)dy = \\int^{+\\infty}_{m}p(y)dy\n",
    "$\n",
    "\n",
    "Again, these equations can be solved here only numerically.\n",
    "\n",
    "### c)\n",
    "The mean of a two-dimensional distribution is a vector of means of the marginal distributions\n",
    "\n",
    "$\n",
    "\\mathbf{\\mu} = \\begin{bmatrix}4\\\\0.8\\end{bmatrix}\n",
    "$\n",
    "\n",
    "The mode of two dimensional distribution is obtained first by solving the extremum conditions\n",
    "\n",
    "$\n",
    "\\frac{\\partial p(x,y)}{\\partial x} = 0, \\frac{\\partial p(x,y)}{\\partial y} = 0\n",
    "$\n",
    "\n",
    "and then verifying for every solution that it is indeed a peak, i.e.\n",
    "\n",
    "$\n",
    "\\frac{\\partial^2 p(x,y)}{\\partial x^2} < 0, \\frac{\\partial^2 p(x,y)}{\\partial y^2} < 0,\\\\\n",
    "\\det\\left(\n",
    "\\begin{bmatrix}\n",
    "\\frac{\\partial^2 p(x,y)}{\\partial x^2} & \\frac{\\partial^2 p(x,y)}{\\partial x\\partial y}\\\\\n",
    "\\frac{\\partial^2 p(x,y)}{\\partial x\\partial y} & \\frac{\\partial^2 p(x,y)}{\\partial y^2}\n",
    "\\end{bmatrix}\n",
    "\\right) > 0\n",
    "$\n",
    "\n",
    "Again, these squations can be solved only numerically.\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.3\n",
    "\n",
    "The conjugate prior to the Bernoulli distribution is the Beta distribution\n",
    "\n",
    "$\n",
    "p(\\mu | \\alpha, \\beta) =\\frac{1}{\\mathcal{B}(\\alpha, \\beta)} \\mu^{\\alpha -1}(1-\\mu)^{\\beta-1} \n",
    "\\propto \\mu^{\\alpha -1}(1-\\mu)^{\\beta-1},\n",
    "$\n",
    "\n",
    "where $\\alpha,\\beta$ are not necessarily integers and the normalization coefficient si the Beta function defined as\n",
    "\n",
    "$\n",
    "\\mathcal{B}(\\alpha, \\beta) = \\int_0^1 t^{\\alpha -1}(1-t)^{\\beta-1}dt\n",
    "$\n",
    "\n",
    "The likelihood of observing data $\\{x_1, x_2, ..., x_N\\}$ is \n",
    "\n",
    "$p(x_1, ..., x_N|\\mu) = \\prod_{i=1}^Np(x_i|\\mu) = \\prod_{i=1}^N \\mu^{x_i}(1-\\mu)^{1-x_i} =\n",
    "\\mu^{\\sum_{i=1}^N x_i}(1-\\mu)^{N-\\sum_{i=1}^N x_i}\n",
    "$\n",
    "\n",
    "The posterior distribution is proportional to teh rpoduct of this likelihood with teh prior distribution (Bayes theorem):\n",
    "\n",
    "$\n",
    "p(\\mu |x_1, ..., x_N) \\propto p(x_1, ..., x_N|\\mu)p(\\mu | \\alpha, \\beta) \\propto\n",
    "\\mu^{\\sum_{i=1}^N x_i + \\alpha -1}(1-\\mu)^{N-\\sum_{i=1}^N x_i +\\beta -1}\n",
    "$\n",
    "\n",
    "This is also a Beta distribution, i.e. our choice of the gonjugate prior was correct. The normalization constant is readily determined:\n",
    "\n",
    "$\n",
    "p(\\mu |x_1, ..., x_N) = \\frac{1}{\\mathcal{B}(\\sum_{i=1}^N x_i + \\alpha -1, N-\\sum_{i=1}^N x_i +\\beta -1)}\n",
    "\\mu^{\\sum_{i=1}^N x_i + \\alpha -1}(1-\\mu)^{N-\\sum_{i=1}^N x_i +\\beta -1}\n",
    "$\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.4\n",
    "\n",
    "The probabilities of picking a mango or an apple from teh first bag are given by\n",
    "\n",
    "$\n",
    "p(mango |1) = \\frac{4}{6} = \\frac{2}{3}\\\\\n",
    "p(apple |1) = \\frac{2}{6} = \\frac{1}{3}\n",
    "$\n",
    "\n",
    "The probabilities of picking a mango or an apple from teh second bag are\n",
    "$\n",
    "p(mango |2) = \\frac{4}{8} = \\frac{1}{2}\\\\\n",
    "p(apple |2) = \\frac{4}{8} = \\frac{1}{2}\n",
    "$\n",
    "\n",
    "The probability of picking the first or the second bag are equal to teh probabilities of head and tail respectively:\n",
    "\n",
    "$\n",
    "p(1) = 0.6,\\\\\n",
    "p(2) = 0.4\n",
    "$\n",
    "\n",
    "We now can obtain the probability that the mango was picked from the second bag using Bayes' theorem:\n",
    "\n",
    "$\n",
    "p(2 | mango) = \\frac{p(mango | 2)p(2)}{p(mango)} =\n",
    "\\frac{p(mango | 2)p(2)}{p(mango | 1)p(1) + p(mango | 2)p(2)} =\n",
    "\\frac{\\frac{1}{2}0.4}{\\frac{2}{3}0.6 + \\frac{1}{2}0.4} = \\frac{1}{3}\n",
    "$\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.5\n",
    "\n",
    "### a)\n",
    "\n",
    "$\\mathbf{x}_{t+1}$ is obtained from $\\mathbf{x}_{t}$ by a linear transformation, $\\mathbf{A}\\mathbf{x}_{t}$ and adding a Gaussian random variabme $\\mathbf{w}$. Initial distribution for $\\mathbf{x}_{0}$ is a Gaussian distribution, a linear transformation of a Gaussian random variable is also a Gaussian random variable, whareas a sum of Gaussian random variables is a Gaussian random variable. Thus, the joint distribution $p(\\mathbf{x}_{0}, \\mathbf{x}_{1},...,\\mathbf{x}_{T})$ is also a Gaussian distribution.\n",
    "\n",
    "### b)\n",
    "#### 1)\n",
    "Let $\\mathbf{z} = \\mathbf{A}\\mathbf{x}_{t+1}$. Since this is a linear transformation of a Gaussian random variable, $\\mathbf{x}_t \\sim \\mathcal{N}(\\mathbf{\\mu}_t,\\mathbf{\\Sigma})$, then $\\mathbf{z}$ is distributed as (see Eq. (6.88))\n",
    "\n",
    "$\n",
    "\\mathbf{z} \\sim \\mathcal{N}(\\mathbf{A}\\mathbf{\\mu}_t, \\mathbf{A}\\mathbf{\\Sigma}\\mathbf{A}^T),\n",
    "$\n",
    "\n",
    "whereas the mean and the covariance of a sum of two Gaussian random variables are given by the sum of the means and the covariances of these variables, i.e.,\n",
    "\n",
    "$\n",
    "\\mathbf{x}_{t+1} = \\mathbf{z} + \\mathbf{w} \n",
    "\\sim \n",
    "\\mathcal{N}(\\mathbf{A}\\mathbf{\\mu}_t, \\mathbf{A}\\mathbf{\\Sigma}\\mathbf{A}^T + \\mathbf{Q}),\n",
    "$\n",
    "\n",
    "That is\n",
    "\n",
    "$\n",
    "p(\\mathbf{x}_{t+1}|\\mathbf{y}_1,...,\\mathbf{y}_t)=\n",
    "\\mathcal{N}(\\mathbf{x}_{t+1}|\\mathbf{A}\\mathbf{\\mu}_t, \\mathbf{A}\\mathbf{\\Sigma}\\mathbf{A}^T + \\mathbf{Q}).\n",
    "$\n",
    "\n",
    "#### 2)\n",
    "If we assume that $\\mathbf{x}_{t+1}$ is fixed, then $\\mathbf{y}_{t+1} = \\mathbf{C}\\mathbf{x}_{t+1} + \\mathbf{v}$ follows the same distribution as $\\mathbf{v}$, but with the mean shifted by $\\mathbf{C}\\mathbf{x}_{t+1}$, i.e.\n",
    "\n",
    "$\n",
    "p(\\mathbf{y}_{t+1}|\\mathbf{x}_{t+1}, \\mathbf{y}_1,...,\\mathbf{y}_t)=\n",
    "\\mathcal{N}(\\mathbf{y}_{t+1}|\\mathbf{C}\\mathbf{x}_{t+1}, \\mathbf{R}).\n",
    "$\n",
    "\n",
    "The the joint probability is obtained as\n",
    "\n",
    "$\n",
    "p(\\mathbf{y}_{t+1}, \\mathbf{x}_{t+1}| \\mathbf{y}_1,...,\\mathbf{y}_t)=\n",
    "p(\\mathbf{y}_{t+1}|\\mathbf{x}_{t+1}, \\mathbf{y}_1,...,\\mathbf{y}_t)\n",
    "p(\\mathbf{x}_{t+1}| \\mathbf{y}_1,...,\\mathbf{y}_t)=\n",
    "\\mathcal{N}(\\mathbf{y}_{t+1}|\\mathbf{C}\\mathbf{x}_{t+1}, \\mathbf{R})\n",
    "\\mathcal{N}(\\mathbf{x}_{t+1}|\\mathbf{A}\\mathbf{\\mu}_t, \\mathbf{A}\\mathbf{\\Sigma}\\mathbf{A}^T + \\mathbf{Q}).\n",
    "$\n",
    "\n",
    "\n",
    "#### 3)\n",
    "Let us introduce temporary notation\n",
    "\n",
    "$\n",
    "\\mathbf{\\mu}_{t+1} = \\mathbf{A}\\mathbf{\\mu}_t,\\\\\n",
    "\\mathbf{\\Sigma}_{t+1} = \\mathbf{A}\\mathbf{\\Sigma}\\mathbf{A}^T + \\mathbf{Q},\\\\\n",
    "p(\\mathbf{x}_{t+1}|\\mathbf{y}_1,...,\\mathbf{y}_t) = \\mathcal{N}(\\mathbf{\\mu}_{t+1}, \\mathbf{\\Sigma}_{t+1})\n",
    "$\n",
    "\n",
    "Then $\\mathbf{y}_{t+1}$ is obtained in terms of the parameters of distribution $p(\\mathbf{x}_{t+1}|\\mathbf{y}_1,...,\\mathbf{y}_t)$ following the same steps as question 1), with the result\n",
    "\n",
    "$\n",
    "p(\\mathbf{y}_{t+1}|\\mathbf{y}_1,...,\\mathbf{y}_t)=\n",
    "\\mathcal{N}(\\mathbf{y}_{t+1}|\\mathbf{C}\\mathbf{\\mu}_{t+1}, \\mathbf{C}\\mathbf{\\Sigma}_{t+1}\\mathbf{C}^T + \\mathbf{R})=\n",
    "\\mathcal{N}\\left(\\mathbf{y}_{t+1}|\\mathbf{C}\\mathbf{A}\\mathbf{\\mu}_t, \\mathbf{C}(\\mathbf{A}\\mathbf{\\Sigma}\\mathbf{A}^T+ \\mathbf{Q})\\mathbf{C}^T + \\mathbf{R}\\right).\n",
    "$\n",
    "\n",
    "The required conditional distribution is then obtained as\n",
    "\n",
    "$\n",
    "p(\\mathbf{x}_{t+1}|\\mathbf{y}_1,...,\\mathbf{y}_t, \\mathbf{y}_{t+1})=\n",
    "\\frac{p(\\mathbf{y}_{t+1}, \\mathbf{x}_{t+1}| \\mathbf{y}_1,...,\\mathbf{y}_t)}\n",
    "{p(\\mathbf{y}_{t+1}| \\mathbf{y}_1,...,\\mathbf{y}_t)}=\n",
    "\\frac{\\mathcal{N}(\\mathbf{y}_{t+1}|\\mathbf{C}\\mathbf{x}_{t+1}, \\mathbf{R})\n",
    "\\mathcal{N}(\\mathbf{x}_{t+1}|\\mathbf{A}\\mathbf{\\mu}_t, \\mathbf{A}\\mathbf{\\Sigma}\\mathbf{A}^T + \\mathbf{Q})}\n",
    "{\\mathcal{N}\\left(\\mathbf{y}_{t+1}|\\mathbf{C}\\mathbf{A}\\mathbf{\\mu}_t, \\mathbf{C}(\\mathbf{A}\\mathbf{\\Sigma}\\mathbf{A}^T + \\mathbf{Q})\\mathbf{C}^T + \\mathbf{R}\\right)}\n",
    "$\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.6\n",
    "\n",
    "The standard definition of variance is \n",
    "\n",
    "$\n",
    "\\mathbb{V}_X[x] = \\mathbb{E}_X[(x-\\mu)^2],\n",
    "$\n",
    "\n",
    "where\n",
    "\n",
    "$\n",
    "\\mu = \\mathbb{E}_X[x].\n",
    "$\n",
    "\n",
    "Using the properties of average we can write:\n",
    "\n",
    "$\n",
    "\\mathbb{V}_X[x] = \\mathbb{E}_X[(x-\\mu)^2] = \\mathbb{E}_X[x^2 - 2x\\mu +\\mu^2] = \\mathbb{E}_X[x^2] - \\mathbb{E}_X[2x\\mu] + \\mathbb{E}_X[\\mu^2]=\\\\\n",
    "\\mathbb{E}_X[x^2] - 2\\mu\\mathbb{E}_X[x] + \\mu^2 = \\mathbb{E}_X[x^2] - 2\\mu^2 + \\mu^2 = \\mathbb{E}_X[x^2] - \\mu^2 \n",
    "$\n",
    "\n",
    "By substituting to this equation the definition of $\\mu$, we obtain the desired equation\n",
    "\n",
    "$\n",
    "\\mathbb{V}_X[x] = \\mathbb{E}_X[(x-\\mu)^2] = \\mathbb{E}_X[x^2] - (\\mathbb{E}_X[x])^2 \n",
    "$\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.7\n",
    "\n",
    "Let is expand the square in the left-hand side of (6.45)\n",
    "\n",
    "$\n",
    "\\frac{1}{N^2}\\sum_{i,j=1}^N(x_i - x_j)^2 =\n",
    "\\frac{1}{N^2}\\sum_{i,j=1}^N(x_i^2 - 2x_i x_j + x_j^2) =\n",
    "\\frac{1}{N^2}\\sum_{i,j=1}^N x_i^2 - 2\\frac{1}{N^2}\\sum_{i,j=1}^N x_i x_j + \\frac{1}{N^2}\\sum_{i,j=1}^Nx_j^2\n",
    "$\n",
    "\n",
    "We see that the first and the last term differ only by the summation index, i.e. they are identical:\n",
    "$\n",
    "\\frac{1}{N^2}\\sum_{i,j=1}^N x_i^2 + \\frac{1}{N^2}\\sum_{i,j=1}^Nx_j^2= 2\\frac{1}{N^2}\\sum_{i,j=1}^N x_i^2 = 2\\frac{1}{N}\\sum_{i=1}^N x_i^2,\n",
    "$\n",
    "\n",
    "since summation over $j$ gives factor $N$.\n",
    "\n",
    "The remaining term can be written as\n",
    "\n",
    "$\n",
    "2\\frac{1}{N^2}\\sum_{i,j=1}^N x_i x_j = \n",
    "2\\frac{1}{N^2}\\sum_{i=1}^N x_i \\sum_{i=1}^N x_j =\n",
    "2\\left(\\frac{1}{N}\\sum_{i=1}^N x_i\\right)^2,\n",
    "$\n",
    "\n",
    "where we again used the fact that the sum is invariant to the index of summation. \n",
    "\n",
    "We thus have proved the required relation that\n",
    "\n",
    "$\n",
    "\\frac{1}{N^2}\\sum_{i,j=1}^N(x_i - x_j)^2 =\n",
    "2\\frac{1}{N}\\sum_{i=1}^N x_i^2 - 2\\left(\\frac{1}{N}\\sum_{i=1}^N x_i\\right)^2\n",
    "$\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.8\n",
    "Bernoulli distribution is given by\n",
    "\n",
    "$\n",
    "p(x|\\mu) = \\mu^x (1-\\mu)^{1-x}\n",
    "$\n",
    "\n",
    "We can use relation\n",
    "\n",
    "$\n",
    "a^x = e^{x\\log a}\n",
    "$\n",
    "\n",
    "to write the Bernoulli distribution as\n",
    "\n",
    "$\n",
    "p(x|\\mu) = e^{x\\log\\mu + (1-x)\\log(1-\\mu)}=\n",
    "e^{x\\log\\left(\\frac{\\mu}{1-\\mu}\\right) + \\log(1-\\mu)} =\n",
    "h(x)e^{\\theta x - A(\\theta)},\n",
    "$\n",
    "\n",
    "where the last equation is the definition of a single-parameter distribution from the exponential family, in which \n",
    "\n",
    "$\n",
    "h(x) = 1,\\\\\n",
    "\\theta = \\log\\left(\\frac{\\mu}{1-\\mu}\\right) \\leftrightarrow \\mu = \\frac{e^\\theta}{1+e^\\theta},\\\\\n",
    "A(\\theta) = -\\log(1-\\mu) = \\log(1+e^\\theta)\n",
    "$\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.9\n",
    "\n",
    "The binomial distribution can be transformed as\n",
    "\n",
    "$\n",
    "p(x|N,\\mu) = {N\\choose x} \\mu^x (1-\\mu)^{N-x} =\n",
    "{N \\choose x} e^{x\\log\\mu + (N-x)\\log(1-\\mu)}=\n",
    "{N \\choose x}e^{x\\log\\left(\\frac{\\mu}{1-\\mu}\\right) +N\\log(1-\\mu)} =\n",
    "h(x)e^{x\\theta - A(\\theta)}\n",
    "$\n",
    "\n",
    "where\n",
    "\n",
    "$\n",
    "h(x) = {N \\choose x},\\\\\n",
    "\\theta = \\log\\left(\\frac{\\mu}{1-\\mu}\\right),\\\\\n",
    "A(\\theta) = -N\\log(1-\\mu) = N\\log(1+e^\\theta)\n",
    "$\n",
    "\n",
    "i.e., the binomial distribution can be represented as an exponential family distribution(only $\\mu$ is treated here as a parameter, since the number of trials $N$ is fixed.)\n",
    "\n",
    "Similarly, the beta distribution can be transoformed as\n",
    "\n",
    "$\n",
    "p(x |\\alpha, \\beta) = \\frac{1}{\\mathcal{B}(\\alpha,\\beta)} x^{\\alpha-1}(1-x)^{\\beta-1} =\n",
    "e^{(\\alpha -1)\\log x + (\\beta -1)\\log(1-x) - \\log(\\mathcal{B}(\\alpha,\\beta))}=\n",
    "h(x)e^{\\theta_1\\phi_1(x) + \\theta_2\\phi_2(x) -A(\\theta_1, \\theta_2)}\n",
    "$\n",
    "\n",
    "where\n",
    "\n",
    "$\n",
    "h(x) = 1,\\\\ \n",
    "\\theta_1 = \\alpha-1, \\theta_2 = \\beta-1,\\\\\n",
    "\\phi_1(x) = \\log x, \\phi_2(x) = \\log(1-x),\\\\\n",
    "A(\\theta_1, \\theta_2) = \\log(\\mathcal{B}(\\alpha,\\beta)) = \\log(\\mathcal{B}(1+\\theta_1,1 + \\theta_2))\n",
    "$\n",
    "\n",
    "i.e. this is a distribution form the exponential family.\n",
    "\n",
    "The product of the two distributions is then given by\n",
    "\n",
    "$\n",
    "p(x|N,\\mu) p(x|\\alpha, \\beta)=\n",
    "{N \\choose x}e^{x\\log\\left(\\frac{\\mu}{1-\\mu}\\right) + (\\alpha-1)\\log x + (\\beta -1)\\log(1-x) + N\\log(1-\\mu) - \\log(\\mathcal{B}(\\alpha,\\beta))}= h(x) e^{\\theta_1 \\phi_1(x) + \\theta_2 \\phi_2(x) + \\theta_3\\phi_3(x) - A(\\theta_1, \\theta_2, \\theta_3)}\n",
    "$\n",
    "\n",
    "where\n",
    "\n",
    "$\n",
    "h(x) = {N \\choose x},\\\\ \n",
    "\\theta_1 = \\alpha-1, \\theta_2 = \\beta-1,\\theta_3 = \\log\\left(\\frac{\\mu}{1-\\mu}\\right)\\\\\n",
    "\\phi_1(x) = \\log x, \\phi_2(x) = \\log(1-x), \\phi_3(x) = x\\\\\n",
    "A(\\theta_1, \\theta_2, \\theta_3) = \\log(\\mathcal{B}(\\alpha,\\beta)) -N\\log(1-\\mu) = \\log(\\mathcal{B}(1+\\theta_1,1 + \\theta_2)) + N\\log(1+e^\\theta_3)\n",
    "$\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.10\n",
    "\n",
    "### a)\n",
    "The two normal distributions are given by\n",
    "\n",
    "$\n",
    "\\mathcal{N}(\\mathbf{x}|\\mathbf{a}, \\mathbf{A}) =\n",
    "(2\\pi)^{-\\frac{D}{2}}|\\mathbf{A}|^{-\\frac{1}{2}} \n",
    "\\exp\\left[-\\frac{1}{2}(\\mathbf{x} - \\mathbf{a})^T\\mathbf{A}^{-1}(\\mathbf{x} - \\mathbf{a})\\right],\\\\\n",
    "\\mathcal{N}(\\mathbf{x}|\\mathbf{b}, \\mathbf{B}) =\n",
    "(2\\pi)^{-\\frac{D}{2}}|\\mathbf{B}|^{-\\frac{1}{2}} \n",
    "\\exp\\left[-\\frac{1}{2}(\\mathbf{x} - \\mathbf{b})^T\\mathbf{B}^{-1}(\\mathbf{x} - \\mathbf{b})\\right]\n",
    "$\n",
    "\n",
    "their product is\n",
    "\n",
    "$\n",
    "\\mathcal{N}(\\mathbf{x}|\\mathbf{a}, \\mathbf{A}) \\mathcal{N}(\\mathbf{x}|\\mathbf{b}, \\mathbf{B}) =\n",
    "(2\\pi)^{-D}|\\mathbf{A}\\mathbf{B}|^{-\\frac{1}{2}} \n",
    "\\exp\\left\\{-\\frac{1}{2}\\left[(\\mathbf{x} - \\mathbf{a})^T\\mathbf{A}^{-1}(\\mathbf{x} - \\mathbf{a})+(\\mathbf{x} - \\mathbf{b})^T\\mathbf{B}^{-1}(\\mathbf{x} - \\mathbf{b})\\right]\\right\\}\n",
    "$\n",
    "\n",
    "The expression in the exponent can be written as\n",
    "\n",
    "$\n",
    "\\Phi = (\\mathbf{x} - \\mathbf{a})^T\\mathbf{A}^{-1}(\\mathbf{x} - \\mathbf{a})+(\\mathbf{x} - \\mathbf{b})^T\\mathbf{B}^{-1}(\\mathbf{x} - \\mathbf{b})=\\\\\n",
    "\\mathbf{x}^T\\mathbf{A}^{-1}\\mathbf{x} - \\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{x} - \\mathbf{x}^T\\mathbf{A}^{-1}\\mathbf{a}+ \\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{a}+\n",
    "\\mathbf{x}^T\\mathbf{B}^{-1}\\mathbf{x} - \\mathbf{b}^T\\mathbf{B}^{-1}\\mathbf{x} - \\mathbf{x}^T\\mathbf{B}^{-1}\\mathbf{b}+ \\mathbf{b}^T\\mathbf{B}^{-1}\\mathbf{b}=\\\\\n",
    "\\mathbf{x}^T(\\mathbf{A}^{-1}+\\mathbf{B}^{-1})\\mathbf{x}- \n",
    "(\\mathbf{a}^T\\mathbf{A}^{-1} + \\mathbf{b}^T\\mathbf{B}^{-1})\\mathbf{x}- \n",
    "\\mathbf{x}^T(\\mathbf{A}^{-1}\\mathbf{a} + \\mathbf{B}^{-1}\\mathbf{b})+ \\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{a} + \\mathbf{b}^T\\mathbf{B}^{-1}\\mathbf{b}\n",
    "$\n",
    "\n",
    "we now introduce notation\n",
    "\n",
    "$\n",
    "\\mathbf{C}^{-1} = (\\mathbf{A}^{-1}+\\mathbf{B}^{-1}),\\\\\n",
    "\\mathbf{c} = \\mathbf{C}(\\mathbf{A}^{-1}\\mathbf{a} + \\mathbf{B}^{-1}\\mathbf{b}),\\\\\n",
    "\\mathbf{c}^T = (\\mathbf{a}^T\\mathbf{A}^{-1} + \\mathbf{b}^T\\mathbf{B}^{-1})C\\text{   (This can be checked by transposing the previous equation)}\n",
    "$\n",
    "\n",
    "The expression in the exponent now takes form\n",
    "\n",
    "$\n",
    "\\Phi=\n",
    "\\mathbf{x}^T\\mathbf{C}^{-1}\\mathbf{x} - \\mathbf{c}^T\\mathbf{C}^{-1}\\mathbf{x} - \\mathbf{x}^T\\mathbf{C}^{-1}\\mathbf{c}+ \\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{a} + \\mathbf{b}^T\\mathbf{B}^{-1}\\mathbf{b}=\\\\\n",
    "\\mathbf{x}^T\\mathbf{C}^{-1}\\mathbf{x} - \\mathbf{c}^T\\mathbf{C}^{-1}\\mathbf{x} - \\mathbf{x}^T\\mathbf{C}^{-1}\\mathbf{c}+ \\mathbf{c}^T\\mathbf{C}^{-1}\\mathbf{c} + \\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{a} + \\mathbf{b}^T\\mathbf{B}^{-1}\\mathbf{b}- \\mathbf{c}^T\\mathbf{C}^{-1}\\mathbf{c}=\\\\\n",
    "(\\mathbf{x} - \\mathbf{c})^T\\mathbf{C}^{-1}(\\mathbf{x} - \\mathbf{c})+\n",
    "\\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{a} + \\mathbf{b}^T\\mathbf{B}^{-1}\\mathbf{b} - \\mathbf{c}^T\\mathbf{C}^{-1}\\mathbf{c}\n",
    "$\n",
    "\n",
    "where we have completed the square.\n",
    "\n",
    "The product of the two probability distributions can be now written as\n",
    "\n",
    "$\n",
    "\\mathcal{N}(\\mathbf{x}|\\mathbf{a}, \\mathbf{A}) \\mathcal{N}(\\mathbf{x}|\\mathbf{b}, \\mathbf{B}) =\n",
    "(2\\pi)^{-D}|\\mathbf{A}\\mathbf{B}|^{-\\frac{1}{2}} \n",
    "\\exp\\left\\{-\\frac{1}{2}\\left[(\\mathbf{x} - \\mathbf{c})^T\\mathbf{C}^{-1}(\\mathbf{x} - \\mathbf{c})+\n",
    "\\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{a} + \\mathbf{b}^T\\mathbf{B}^{-1}\\mathbf{b} - \\mathbf{c}^T\\mathbf{C}^{-1}\\mathbf{c}\n",
    "\\right]\\right\\}=\\\\\n",
    "(2\\pi)^{-\\frac{D}{2}}|\\mathbf{C}|^{-\\frac{1}{2}} \n",
    "\\exp\\left[-\\frac{1}{2}(\\mathbf{x} - \\mathbf{c})^T\\mathbf{C}^{-1}(\\mathbf{x} - \\mathbf{c})\\right]\n",
    "\\times\n",
    "(2\\pi)^{-\\frac{D}{2}}\\frac{|\\mathbf{A}\\mathbf{B}|^{-\\frac{1}{2}}}{|\\mathbf{C}|^{-\\frac{1}{2}}}\n",
    "\\exp\\left\\{-\\frac{1}{2}\\left[\n",
    "\\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{a} + \\mathbf{b}^T\\mathbf{B}^{-1}\\mathbf{b} - \\mathbf{c}^T\\mathbf{C}^{-1}\\mathbf{c}\n",
    "\\right]\\right\\}=\\\\\n",
    "c\\mathcal{N}(\\mathbf{c}|\\mathbf{c}, \\mathbf{C}),\n",
    "$\n",
    "\n",
    "where we defined\n",
    "\n",
    "$\n",
    "c = (2\\pi)^{-\\frac{D}{2}}\\frac{|\\mathbf{A}\\mathbf{B}|^{-\\frac{1}{2}}}{|\\mathbf{C}|^{-\\frac{1}{2}}}\n",
    "\\exp\\left\\{-\\frac{1}{2}\\left[\n",
    "\\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{a} + \\mathbf{b}^T\\mathbf{B}^{-1}\\mathbf{b} - \\mathbf{c}^T\\mathbf{C}^{-1}\\mathbf{c}\n",
    "\\right]\\right\\}\n",
    "$\n",
    "\n",
    "We now can used the properties that a) the determinant of a matrix product is product of the determinants, and b) determinant of a matrix inverse is the inverse of the determinant of this matrix, and write\n",
    "\n",
    "$\n",
    "\\frac{|\\mathbf{A}||\\mathbf{B}|}{|\\mathbf{C}|}=\n",
    "|\\mathbf{A}||\\mathbf{C}^{-1}||\\mathbf{B}|=\n",
    "|\\mathbf{A}\\mathbf{C}^{-1}\\mathbf{B}|=\n",
    "|\\mathbf{A}(\\mathbf{A}^{-1} + \\mathbf{B}^{-1})\\mathbf{B}|=\n",
    "|\\mathbf{A} + \\mathbf{B}|\n",
    "$\n",
    "\n",
    "For the expression in the exponent we can write\n",
    "\n",
    "$\n",
    "\\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{a} + \\mathbf{b}^T\\mathbf{B}^{-1}\\mathbf{b} - \\mathbf{c}^T\\mathbf{C}^{-1}\\mathbf{c}= \\\\\n",
    "\\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{a} + \\mathbf{b}^T\\mathbf{B}^{-1}\\mathbf{b}-\n",
    "(\\mathbf{a}^T\\mathbf{A}^{-1} + \\mathbf{b}^T\\mathbf{B}^{-1})(\\mathbf{A}^{-1} + \\mathbf{B}^{-1})^{-1}\n",
    "(\\mathbf{A}^{-1}\\mathbf{a} + \\mathbf{B}^{-1}\\mathbf{b})=\\\\\n",
    "\\mathbf{a}^T\\left[\\mathbf{A}^{-1} - \\mathbf{A}^{-1}(\\mathbf{A}^{-1} + \\mathbf{B}^{-1})\\mathbf{A}^{-1}\\right]\\mathbf{a}+ \n",
    "\\mathbf{b}^T\\left[\\mathbf{B}^{-1} - \\mathbf{B}^{-1}(\\mathbf{A}^{-1} + \\mathbf{B}^{-1})\\mathbf{B}^{-1}\\right]\\mathbf{b}-\n",
    "\\mathbf{a}^T\\mathbf{A}^{-1}(\\mathbf{A}^{-1} + \\mathbf{B}^{-1})^{-1}\n",
    "\\mathbf{B}^{-1}\\mathbf{b}-\n",
    "\\mathbf{b}^T\\mathbf{B}^{-1}(\\mathbf{A}^{-1} + \\mathbf{B}^{-1})^{-1}\n",
    "\\mathbf{A}^{-1}\\mathbf{a}\n",
    "$\n",
    "\n",
    "Using the property $(\\mathbf{A}\\mathbf{B})^{-1} = \\mathbf{B}^{-1}\\mathbf{A}^{-1}$ we obtain\n",
    "\n",
    "$\n",
    "\\mathbf{A}^{-1}(\\mathbf{A}^{-1} + \\mathbf{B}^{-1})^{-1}\n",
    "\\mathbf{B}^{-1}=\n",
    "\\left[\\mathbf{B}(\\mathbf{A}^{-1} + \\mathbf{B}^{-1})\\mathbf{A}\\right]^{-1}=\n",
    "(\\mathbf{A} + \\mathbf{B})^{-1}\n",
    "$\n",
    "\n",
    "and\n",
    "\n",
    "$\n",
    "\\mathbf{A}^{-1} - \\mathbf{A}^{-1}(\\mathbf{A}^{-1} + \\mathbf{B}^{-1})\\mathbf{A}^{-1}=\n",
    "\\mathbf{A}^{-1}\\left[1 - (\\mathbf{A}^{-1} + \\mathbf{B}^{-1})\\mathbf{A}^{-1}\\right]=\n",
    "\\mathbf{A}^{-1}\\left[1 - \\mathbf{B}(\\mathbf{A} + \\mathbf{B})^{-1}\\mathbf{A}\\mathbf{A}^{-1}\\right]=\n",
    "\\mathbf{A}^{-1}\\left[1 - \\mathbf{B}(\\mathbf{A} + \\mathbf{B})^{-1}\\right]=\n",
    "\\mathbf{A}^{-1}\\left[(\\mathbf{A} + \\mathbf{B}) - \\mathbf{B}\\right](\\mathbf{A} + \\mathbf{B})^{-1}=\n",
    "(\\mathbf{A} + \\mathbf{B})^{-1}\n",
    "$\n",
    "\n",
    "we thus conclude that\n",
    "\n",
    "$\n",
    "c = (2\\pi)^{-\\frac{D}{2}}|\\mathbf{A}+\\mathbf{B}|^{-\\frac{1}{2}}\n",
    "\\exp\\left\\{-\\frac{1}{2}(\n",
    "\\mathbf{a} - \\mathbf{b})^T(\\mathbf{A} + \\mathbf{B})^{-1}(\\mathbf{a} -\\mathbf{b})\\right\\}=\n",
    "\\mathcal{N}(\\mathbf{b}|\\mathbf{a}, \\mathbf{A}+ \\mathbf{B})=\n",
    "\\mathcal{N}(\\mathbf{a}|\\mathbf{b}, \\mathbf{A}+ \\mathbf{B})\n",
    "$\n",
    "\n",
    "### b)\n",
    "Multivariate normal distribution, $\\mathcal{N}(\\mathbf{x}|\\mathbf{a},\\mathbf{A})$ can be represented as a distribution from an exponential family:\n",
    "\n",
    "$\n",
    "\\mathcal{N}(\\mathbf{x}|\\mathbf{a},\\mathbf{A})=\n",
    "(2\\pi)^{-\\frac{D}{2}}|\\mathbf{A}|^{-\\frac{1}{2}} \n",
    "\\exp\\left[-\\frac{1}{2}(\\mathbf{x} - \\mathbf{a})^T\\mathbf{A}^{-1}(\\mathbf{x} - \\mathbf{a})\\right]=\\\\\n",
    "(2\\pi)^{-\\frac{D}{2}}\n",
    "\\exp\\left[-\\frac{1}{2}\\text{tr}(\\mathbf{A}^{-1}\\mathbf{x}\\mathbf{x}^T) + \\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{x}-\n",
    "\\frac{1}{2}\\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{a} - \\frac{1}{2}\\log|\\mathbf{A}|\n",
    "\\right],\n",
    "$\n",
    "\n",
    "where we used that $\\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{x} = \\mathbf{x}^T\\mathbf{A}^{-1}\\mathbf{a}$, and also write the first term as\n",
    "\n",
    "$\n",
    "\\mathbf{x}^T\\mathbf{A}^{-1}\\mathbf{x}= \\sum_{i,j}x_i (\\mathbf{A}^{-1})_{ij} x_j= \\sum_{i,j}(\\mathbf{A}^{-1})_{ij} x_j x_i= \\sum_{i,j}(\\mathbf{A}^{-1})_{ij} (\\mathbf{x}\\mathbf{x}^T)_{ji}= \\text{tr}(\\mathbf{A}^{-1}\\mathbf{x}\\mathbf{x}^T)\n",
    "$\n",
    "\n",
    "Representing $\\mathcal{N}(\\mathbf{x}|\\mathbf{b},\\mathbf{B})$ in a similar way and multiplying the two distributions we readily obtain\n",
    "\n",
    "$\n",
    "\\mathcal{N}(\\mathbf{x}|\\mathbf{a},\\mathbf{A})\\mathcal{N}(\\mathbf{x}|\\mathbf{b},\\mathbf{B})=\n",
    "(2\\pi)^{-D}\n",
    "\\exp\\left\\{-\\frac{1}{2}\\text{tr}\\left[(\\mathbf{A}^{-1}+ \\mathbf{B}^{-1})\\mathbf{x}\\mathbf{x}^T\\right]+ \n",
    "(\\mathbf{a}^T\\mathbf{A}^{-1}+\\mathbf{b}^T\\mathbf{B}^{-1})\\mathbf{x}-\n",
    "\\frac{1}{2}\\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{a} - \\frac{1}{2}\\log|\\mathbf{A}|-\n",
    "\\frac{1}{2}\\mathbf{b}^T\\mathbf{B}^{-1}\\mathbf{b} - \\frac{1}{2}\\log|\\mathbf{B}|\n",
    "\\right\\}=\\\\\n",
    "c\\mathcal{N}(\\mathbf{x}|\\mathbf{c},\\mathbf{C}),\n",
    "$\n",
    "\n",
    "where we defined\n",
    "\n",
    "$\n",
    "\\mathbf{C}^{-1} = \\mathbf{A}^{-1}+ \\mathbf{B}^{-1},\\\\\n",
    "\\mathbf{c}^T\\mathbf{C}^{-1} = \\mathbf{a}^T\\mathbf{A}^{-1}+\\mathbf{b}^T\\mathbf{B}^{-1},\\\\\n",
    "c = (2\\pi)^{-\\frac{D}{2}}\n",
    "\\exp\\left\\{\\frac{1}{2}\\mathbf{c}^T\\mathbf{C}^{-1}\\mathbf{c} + \\frac{1}{2}\\log|\\mathbf{C}|-\n",
    "\\frac{1}{2}\\mathbf{a}^T\\mathbf{A}^{-1}\\mathbf{a} - \\frac{1}{2}\\log|\\mathbf{A}|-\n",
    "\\frac{1}{2}\\mathbf{b}^T\\mathbf{B}^{-1}\\mathbf{b} - \\frac{1}{2}\\log|\\mathbf{B}|\n",
    "\\right\\}\n",
    "$\n",
    "\n",
    "Coefficient $c$ can now be reduced to the required form using the matrix transformations described in part a).\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.11\n",
    "\n",
    "The expectation value and the conditional expectation value are given by\n",
    "\n",
    "$\n",
    "\\mathbb{E}_X[x] = \\int x p(x) dx,\\\\\n",
    "\\mathbb{E}_Y[f(y)] = \\int f(y) p(y) dy,\\\\\n",
    "\\mathbb{E}_X[x|y] = \\int x p(x|y) dx\n",
    "$\n",
    "\n",
    "We then have\n",
    "\n",
    "$\n",
    "\\mathbb{E}_Y\\left[\\mathbb{E}_X[x|y]\\right] =\n",
    "\\int \\mathbb{E}_X[x|y] p(y) dy =\n",
    "\\int \\left[\\int xp(x|y)dx\\right]p(y) dy =\n",
    "\\int \\int xp(x|y)p(y)dx dy =\n",
    "\\int\\int xp(x,y)dxdy =\n",
    "\\int x\\left[\\int p(x,y) dy\\right] dx =\n",
    "\\int x p(x) dx =\n",
    "\\mathbb{E}_X[x],\n",
    "$\n",
    "\n",
    "where we used the definition fo the conditional probability density\n",
    "\n",
    "$\n",
    "p(x|y)p(y) = p(x,y)\n",
    "$\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.12\n",
    "\n",
    "### a)\n",
    "If $\\mathbf{x}$ is fixed, then $\\mathbf{y}$ has the same distribution as $\\mathbf{w}$, but with the mean shifter by $\\mathbf{A}\\mathbf{x} + \\mathbf{b}$, that is\n",
    "\n",
    "$\n",
    "p(\\mathbf{y}|\\mathbf{x}) = \\mathcal{N}(\\mathbf{y}|\\mathbf{A}\\mathbf{x} + \\mathbf{b}, \\mathbf{Q})\n",
    "$\n",
    "\n",
    "### b)\n",
    "Let us consider random variable $\\mathbf{u} = \\mathbf{A}\\mathbf{x}$, it is distributed according to\n",
    "\n",
    "$\n",
    "p(\\mathbf{u}) = \\mathcal{N}(\\mathbf{u}|\\mathbf{A}\\mathbf{\\mu}_x, \\mathbf{A}\\mathbf{\\Sigma}_x\\mathbf{A}^T).\n",
    "$\n",
    "\n",
    "Then $\\mathbf{y}$ is a sum of two Gaussian random variables $\\mathbf{u}$ and $\\mathbf{w}$ with its mean additionally shifted by $\\mathbf{b}$, that is\n",
    "\n",
    "$\n",
    "p(\\mathbf{y}) = \n",
    "\\mathcal{N}(\\mathbf{y}|\\mathbf{A}\\mathbf{\\mu}_x + \\mathbf{b}, \\mathbf{A}\\mathbf{\\Sigma}_x\\mathbf{A}^T + \\mathbf{Q}),\n",
    "$\n",
    "\n",
    "that is\n",
    "\n",
    "$\n",
    "\\mathbf{\\mu}_y = \\mathbf{A}\\mathbf{\\mu}_x + \\mathbf{b},\\\\\n",
    "\\mathbf{\\Sigma}_y = \\mathbf{A}\\mathbf{\\Sigma}_x\\mathbf{A}^T + \\mathbf{Q}.\n",
    "$\n",
    "\n",
    "### c) \n",
    "Like in b), assuming that $\\mathbf{y}$ is fixed we obtain the conditional distribution\n",
    "\n",
    "$\n",
    "p(\\mathbf{z}|\\mathbf{y}) = \\mathcal{N}(\\mathbf{z}|\\mathbf{C}\\mathbf{y}, \\mathbf{R})\n",
    "$\n",
    "\n",
    "Since $\\mathbf{C}\\mathbf{y}$ is a Gausssian random variable with distribution $\\mathcal{N}(\\mathbf{C}\\mathbf{\\mu}_y, \\mathbf{C}\\mathbf{\\Sigma}_y\\mathbf{C}^T)$ we obtain the distribution of $\\mathbf{z}$ as that of a sum of two Gaussian random variables:\n",
    "\n",
    "$\n",
    "p(\\mathbf{z})=\n",
    "\\mathcal{N}(\\mathbf{z} |\\mathbf{C}\\mathbf{\\mu}_y, \\mathbf{C}\\mathbf{\\Sigma}_y\\mathbf{C}^T + \\mathbf{R})=\n",
    "\\mathcal{N}(\\mathbf{z} |\\mathbf{C}(\\mathbf{A}\\mathbf{\\mu}_x + \\mathbf{b}), \n",
    "\\mathbf{C}(\\mathbf{A}\\mathbf{\\Sigma}_x\\mathbf{A}^T + \\mathbf{Q})\\mathbf{C}^T + \\mathbf{R})\n",
    "$\n",
    "\n",
    "### d) \n",
    "The posterior distribution $p(\\mathbf{x}|\\mathbf{y})$ can be obtained by applying the Bayes' theorem:\n",
    "\n",
    "$\n",
    "p(\\mathbf{x}|\\mathbf{y})=\n",
    "\\frac{p(\\mathbf{y}|\\mathbf{x})p(\\mathbf{x})}{p(\\mathbf{y})}=\n",
    "\\frac{\\mathcal{N}(\\mathbf{y}|\\mathbf{A}\\mathbf{x} + \\mathbf{b}, \\mathbf{Q})\\mathcal{N}(\\mathbf{x}|\\mathbf{\\mu}_x,\\mathbf{\\Sigma}_x)}\n",
    "{\\mathcal{N}(\\mathbf{y}|\\mathbf{A}\\mathbf{\\mu}_x + \\mathbf{b}, \\mathbf{A}\\mathbf{\\Sigma}_x\\mathbf{A}^T + \\mathbf{Q})}\n",
    "$\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.13\n",
    "\n",
    "Cdf is related to pdf as\n",
    "\n",
    "$\n",
    "F_x(x) = \\int_{-\\infty}^xdx' f_x(x'),\\\\\n",
    "\\frac{d}{dx} F_x(x) = f_x(x)\n",
    "$\n",
    "\n",
    "and changes in the interval $[0,1]$.\n",
    "\n",
    "The pdf of variable $y=F_x(x)$ then can be defined as\n",
    "\n",
    "$\n",
    "f_y(y) = f_x(x) \\left|\\frac{dx}{dy}\\right| = \\frac{f_x(x)}{\\left|\\frac{dy}{dx}\\right|} = \n",
    "\\frac{f_x(x)}{\\left|\\frac{dF_x(x)}{dx}\\right|} =\n",
    "\\frac{f_x(x)}{f_x(x)} = 1,\n",
    "$\n",
    "\n",
    "i.e. $y$ is uniformly distributed in interval $[0,1]$."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
